Introduction

This a loan dataset from Prosper. I choose 15 variables to analysis, which are EstimatedReturn, ProsperScore, Occupation, IsborroweHomeowner, TotalCreditLinespast7years, Totalinquiries, DelinquenciesLast7Years, TotalTrades, DebtToIncomeRatio, IncomeRangee, TotalProsperPaymentsBilled, ProsperPrincipalBorrowed, ProsperPrincipalOutstanding, LoanOriginalAmount, and MonthlyLoanPayment. There are lots of NA values in each column of dataset, which I won’t get rid of from the dataset other than remove them in respective variable analysis to make as large as possible use of the observations.

Dataset

packages loaded

library(ggplot2)
library(dplyr)
library(gridExtra)
library(grid)
library(SparseM)

data prepared

# load dataset
pld <- read.csv("D:/Udacity/prosperLoanData.csv")

dim(pld)
## [1] 113937     81
# a dataset including all 16 variables
pld.ana <- pld[, c("EstimatedReturn", "ProsperScore", "Occupation", "IsBorrowerHomeowner", 
                   "TotalCreditLinespast7years", "TotalInquiries", "DelinquenciesLast7Years", "TotalTrades", "DebtToIncomeRatio", "IncomeRange", "TotalProsperPaymentsBilled", "ProsperPrincipalBorrowed", "ProsperPrincipalOutstanding", "LoanOriginalAmount", "MonthlyLoanPayment")]

str(pld.ana)
## 'data.frame':    113937 obs. of  15 variables:
##  $ EstimatedReturn            : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperScore               : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ Occupation                 : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ IsBorrowerHomeowner        : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ TotalCreditLinespast7years : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ TotalInquiries             : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ DelinquenciesLast7Years    : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ TotalTrades                : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ DebtToIncomeRatio          : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ TotalProsperPaymentsBilled : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed   : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding: num  NA NA NA NA 9948 ...
##  $ LoanOriginalAmount         : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment         : num  330 319 123 321 564 ...
dim(pld.ana)
## [1] 113937     15
# 7 occupations selected
occupation_sel <- factor(c("Accountant/CPA", "Administrative Assistant", "Computer Programmer", "Executive", "Sales - Commission", "Teacher"))
pld.Occupation <- pld.ana[pld.ana$Occupation %in% occupation_sel, ]
# 7 occupation levels
pld.Occupation$Occupation <- factor(pld.Occupation$Occupation)

# variable "ProsperScore" to factor
pld.ana$ProsperScore <- factor(pld.ana$ProsperScore)

The origianl data contains 81 varibales and 113937 observations, and the analysed data contains 16 variables and 113937 observations. The “TotalProsperLoans” factor has 0-8 levels. The “ProsperScore” factor has 0-11 levels.

Univariate Analysis

# data summary
summary(pld.ana)
##  EstimatedReturn   ProsperScore                      Occupation   
##  Min.   :-0.183   4      :12595   Other                   :28617  
##  1st Qu.: 0.074   6      :12278   Professional            :13628  
##  Median : 0.092   8      :12053   Computer Programmer     : 4478  
##  Mean   : 0.096   7      :10597   Executive               : 4311  
##  3rd Qu.: 0.117   5      : 9813   Teacher                 : 3759  
##  Max.   : 0.284   (Other):27517   Administrative Assistant: 3688  
##  NA's   :29084    NA's   :29084   (Other)                 :55456  
##  IsBorrowerHomeowner TotalCreditLinespast7years TotalInquiries   
##  False:56459         Min.   :  2.00             Min.   :  0.000  
##  True :57478         1st Qu.: 17.00             1st Qu.:  2.000  
##                      Median : 25.00             Median :  4.000  
##                      Mean   : 26.75             Mean   :  5.584  
##                      3rd Qu.: 35.00             3rd Qu.:  7.000  
##                      Max.   :136.00             Max.   :379.000  
##                      NA's   :697                NA's   :1159     
##  DelinquenciesLast7Years  TotalTrades     DebtToIncomeRatio
##  Min.   : 0.000          Min.   :  0.00   Min.   : 0.000   
##  1st Qu.: 0.000          1st Qu.: 15.00   1st Qu.: 0.140   
##  Median : 0.000          Median : 22.00   Median : 0.220   
##  Mean   : 4.155          Mean   : 23.23   Mean   : 0.276   
##  3rd Qu.: 3.000          3rd Qu.: 30.00   3rd Qu.: 0.320   
##  Max.   :99.000          Max.   :126.00   Max.   :10.010   
##  NA's   :990             NA's   :7544     NA's   :8554     
##          IncomeRange    TotalProsperPaymentsBilled
##  $25,000-49,999:32192   Min.   :  0.00            
##  $50,000-74,999:31050   1st Qu.:  9.00            
##  $100,000+     :17337   Median : 16.00            
##  $75,000-99,999:16916   Mean   : 22.93            
##  Not displayed : 7741   3rd Qu.: 33.00            
##  $1-24,999     : 7274   Max.   :141.00            
##  (Other)       : 1427   NA's   :91852             
##  ProsperPrincipalBorrowed ProsperPrincipalOutstanding LoanOriginalAmount
##  Min.   :    0            Min.   :    0               Min.   : 1000     
##  1st Qu.: 3500            1st Qu.:    0               1st Qu.: 4000     
##  Median : 6000            Median : 1627               Median : 6500     
##  Mean   : 8472            Mean   : 2930               Mean   : 8337     
##  3rd Qu.:11000            3rd Qu.: 4127               3rd Qu.:12000     
##  Max.   :72499            Max.   :23451               Max.   :35000     
##  NA's   :91852            NA's   :91852                                 
##  MonthlyLoanPayment
##  Min.   :   0.0    
##  1st Qu.: 131.6    
##  Median : 217.7    
##  Mean   : 272.5    
##  3rd Qu.: 371.6    
##  Max.   :2251.5    
## 
# histograms 
range(pld.ana$EstimatedReturn, na.rm = TRUE)
## [1] -0.1827  0.2837
ggplot(aes(EstimatedReturn), data = subset(pld.ana, !is.na(EstimatedReturn))) +
  geom_histogram()

The range of EstimatedReturn is -0.1827 upto 0.2837, from the histogram, we can see that most of the return is between 0 and 0.2.

# adjusted histogram
plt_EstimatedReturn <- ggplot(aes(EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn))) + geom_histogram(binwidth = 0.001) +
  xlim(0, 0.2)
plt_EstimatedReturn

The most frequent return is arount 0.125, and the distribution seems to be kind of multi-model.

summary(pld.ana$EstimatedReturn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.074   0.092   0.096   0.117   0.284   29084
# histogram of TotalCreditLinespast7years
range(pld.ana$TotalCreditLinespast7years, na.rm = TRUE)
## [1]   2 136
plt_TotalCreditLinespast7years <- ggplot(aes(TotalCreditLinespast7years), data = subset(pld.ana, !is.na(TotalCreditLinespast7years))) +
  geom_histogram(binwidth = .2) +
  scale_x_continuous(limits = c(0, 120), breaks = seq(0, 120, 20))
plt_TotalCreditLinespast7years 

The mode of TotalCreditLinespast7years is a little bit more than 20, and the distribution is close to normal.

summary(pld.ana$TotalCreditLinespast7years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   17.00   25.00   26.75   35.00  136.00     697
# histogram of TotalInquiries
range(pld.ana$TotalInquiries, na.rm = TRUE)
## [1]   0 379
plt_TotalInquiries <- ggplot(aes(TotalInquiries), data = subset(pld.ana, !is.na(TotalInquiries))) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(limits = c(0, 60), breaks = seq(0, 60, 10))
plt_TotalInquiries

The TotalInquiries variable obivously is positively-skewed distributed.

summary(pld.ana$TotalInquiries)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   4.000   5.584   7.000 379.000    1159
# histogram of DelinquenciesLast7Years
range(pld.ana$DelinquenciesLast7Years, na.rm = TRUE)
## [1]  0 99
plt_DelinquenciesLast7Years <- ggplot(aes(DelinquenciesLast7Years), data = subset(pld.ana, !is.na(DelinquenciesLast7Years))) +
  geom_histogram(binwidth = 0.5) 
plt_DelinquenciesLast7Years

Mojarity of the delinquencies are 0, which means morjarity of loans were paid before due.

summary(pld.ana$DelinquenciesLast7Years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.155   3.000  99.000     990
# histogram of TotalTrades
range(pld.ana$TotalTrades, na.rm = TRUE)
## [1]   0 126
plt_TotalTrades <- ggplot(aes(TotalTrades), data = subset(pld.ana, !is.na(TotalTrades))) +
  geom_histogram(binwidth = 0.5)
plt_TotalTrades

TotalTrades distribution is the similar to the variable of TotalCreditLinespast7years, which means the two variables are very close.

summary(pld.ana$TotalTrades)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.00   22.00   23.23   30.00  126.00    7544
# histogram of DebtToIncomeRatio
range(pld.ana$DebtToIncomeRatio, na.rm = TRUE)
## [1]  0.00 10.01
plt_DebtToIncomeRatio <- ggplot(aes(DebtToIncomeRatio), data = subset(pld.ana, !is.na(DebtToIncomeRatio))) +
  geom_histogram(binwidth = 0.01)+
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2))
plt_DebtToIncomeRatio

summary(pld.ana$DebtToIncomeRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The median here is 0.22, which is a little less than the mean of 0.276

# histogram of TotalProsperPaymentsBilled
range(pld.ana$TotalProsperPaymentsBilled, na.rm = TRUE)
## [1]   0 141
plt_TotalProsperPaymentsBilled <- ggplot(aes(TotalProsperPaymentsBilled), data = subset(pld.ana, !is.na(TotalProsperPaymentsBilled))) +
  geom_histogram(binwidth = 0.5) +
  scale_x_continuous(breaks = seq(0, 141, 25))
plt_TotalProsperPaymentsBilled

The plot shows that the numbers of on time payments the borrower made on Prosper loans are mostly 6, 9, 35.

summary(pld.ana$TotalProsperPaymentsBilled)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   22.93   33.00  141.00   91852
# histogram of ProsperPrincipalBorrowed
range(pld.ana$ProsperPrincipalBorrowed, na.rm = TRUE)
## [1]     0 72499
plt_ProsperPrincipalBorrowed <- ggplot(aes(log(ProsperPrincipalBorrowed+1)), data = subset(pld.ana, !is.na(ProsperPrincipalBorrowed))) +
  geom_histogram(binwidth = 0.1) +
  xlim(6,11)

plt_ProsperPrincipalBorrowed 

The mode of the distribution is around 3000.

summary(pld.ana$ProsperPrincipalBorrowed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3500    6000    8472   11000   72500   91852
# histogram of ProsperPrincipalOutstanding
range(pld.ana$ProsperPrincipalOutstanding, na.rm = TRUE)
## [1]     0.00 23450.95
plt_ProsperPrincipalOutstanding <- ggplot(aes(log(ProsperPrincipalOutstanding+1)), data = subset(pld.ana, !is.na(ProsperPrincipalOutstanding))) +
  geom_histogram() 

plt_ProsperPrincipalOutstanding

Most of the outstanding of loan are 0.

summary(pld.ana$ProsperPrincipalOutstanding)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    1627    2930    4127   23450   91852
# histogram of LoanOriginalAmount
range(pld.ana$LoanOriginalAmount, na.rm = TRUE)
## [1]  1000 35000
plt_LoanOriginalAmount <- ggplot(aes(log(LoanOriginalAmount+1)), data = subset(pld.ana, !is.na(LoanOriginalAmount))) +
  geom_histogram() 

plt_LoanOriginalAmount

The values are distributed very unevenly, There are large counts at the values of 4000, 10000 and 15000.

summary(pld.ana$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000
# histogram of MonthlyLoanPayment
range(pld.ana$MonthlyLoanPayment, na.rm = TRUE)
## [1]    0.00 2251.51
plt_MonthlyLoanPayment <- ggplot(aes(log(MonthlyLoanPayment+1)), data = subset(pld.ana, !is.na(MonthlyLoanPayment))) +
  geom_histogram(binwidth = .1) 

plt_MonthlyLoanPayment

Bacially, all value frequency is under 3500, except the value of arount 140, whose frequecy is upto more than 8000.

summary(pld.ana$MonthlyLoanPayment)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0
# histograms with median line
EstimatedReturnMedian <- median(pld.ana$EstimatedReturn, na.rm = TRUE)
TotalCreditLinespast7yearsMedian <- median(pld.ana$TotalCreditLinespast7years, na.rm = TRUE)
TotalInquiriesMedian <- median(pld.ana$TotalInquiries, na.rm = TRUE)
DelinquenciesLast7YearsMedian <- median(pld.ana$DelinquenciesLast7Years, na.rm = TRUE)
TotalTradesMedian <- median(pld.ana$TotalTrades, na.rm = TRUE)
DebtToIncomeRatioMedian <- median(pld.ana$DebtToIncomeRatio, na.rm = TRUE)
TotalProsperPaymentsBilledMedian <- median(pld.ana$TotalProsperPaymentsBilled, na.rm = TRUE)
ProsperPrincipalBorrowedMedian <- median(pld.ana$ProsperPrincipalBorrowed, na.rm = TRUE)
ProsperPrincipalOutstandingMedian <- median(pld.ana$ProsperPrincipalOutstanding, na.rm = TRUE)
LoanOriginalAmountMedian <- median(pld.ana$LoanOriginalAmount, na.rm = TRUE)
MonthlyLoanPaymentMedian <- median(pld.ana$MonthlyLoanPayment, na.rm = TRUE)

plt_EstimatedReturn_MedianLine <- plt_EstimatedReturn + geom_vline(aes(xintercept = EstimatedReturnMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 0.1, y = 2250, label = paste("median: ", EstimatedReturnMedian), colour = "red")

plt_TotalCreditLinespast7years_MedianLine <- plt_TotalCreditLinespast7years + geom_vline(aes(xintercept = TotalCreditLinespast7yearsMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 30, y = 3500, label = paste("median: ", TotalCreditLinespast7yearsMedian), colour = "red")

plt_TotalInquiries_MedianLine <- plt_TotalInquiries + geom_vline(aes(xintercept = TotalInquiriesMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 10, y = 15000, label = paste("median: ", TotalInquiriesMedian), colour = "red")

plt_DelinquenciesLast7Years_MedianLine <- plt_DelinquenciesLast7Years + geom_vline(aes(xintercept = DelinquenciesLast7YearsMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 13, y = 70000, label = paste("median: ", DelinquenciesLast7YearsMedian), colour = "red")

plt_TotalTrades_MedianLine <- plt_TotalTrades + geom_vline(aes(xintercept = TotalTradesMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 35, y = 3500, label = paste("median: ", TotalTradesMedian), colour = "red")

plt_DebtToIncomeRatio_MedianLine <-plt_DebtToIncomeRatio + geom_vline(aes(xintercept = DebtToIncomeRatioMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 0.4, y = 3500, label = paste("median: ", DebtToIncomeRatioMedian), colour = "red")

plt_TotalProsperPaymentsBilled_MedianLine <- plt_TotalProsperPaymentsBilled + geom_vline(aes(xintercept = TotalProsperPaymentsBilledMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 35, y = 1500, label = paste("median: ", TotalProsperPaymentsBilledMedian), colour = "red")

plt_ProsperPrincipalBorrowed_MedianLine <- ggplot(aes(ProsperPrincipalBorrowed), data = subset(pld.ana, !is.na(ProsperPrincipalBorrowed))) +
  geom_histogram(binwidth = 500)+ 
  scale_x_continuous(limits = c(0, 70000), breaks = seq(0, 70000, 10000)) +
  geom_vline(aes(xintercept = ProsperPrincipalBorrowedMedian), col = "royalblue", lwd = 1) + 
  annotate("text", x = 10000, y = 2000, label = paste("median: ", ProsperPrincipalBorrowedMedian), colour = "red")

plt_ProsperPrincipalOutstanding_MedianLine <- ggplot(aes(ProsperPrincipalOutstanding), data = subset(pld.ana, !is.na(ProsperPrincipalOutstanding))) +
  geom_histogram(binwidth = 500)  + 
  scale_x_continuous(limits = c(0, 20000), breaks = seq(0, 20000,2000)) +
  geom_vline(aes(xintercept = ProsperPrincipalOutstandingMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 4500, y = 7000, label = paste("median: ", ProsperPrincipalOutstandingMedian), colour = "red")

plt_LoanOriginalAmount_MedianLine <- ggplot(aes(LoanOriginalAmount), data = subset(pld.ana, !is.na(LoanOriginalAmount))) +
  geom_histogram(binwidth = 500)  + 
  scale_x_continuous(limits = c(1000, 35000), breaks = seq(1000, 35000, 5000)) +
  geom_vline(aes(xintercept = LoanOriginalAmountMedian), col = "royalblue", lwd = 1) +
  annotate("text", x = 9000, y = 10000, label = paste("median: ", LoanOriginalAmountMedian), colour = "red")

plt_MonthlyLoanPayment_MedianLine <- ggplot(aes(MonthlyLoanPayment+1), data = subset(pld.ana, !is.na(MonthlyLoanPayment))) +
  geom_histogram(binwidth = 50)  +
  scale_x_continuous(limits = c(0, 2000), breaks = seq(0, 2000, 400)) +
  geom_vline(aes(xintercept = MonthlyLoanPaymentMedian), col = "royalblue", lwd = 1 ) +
  annotate("text", x = 380, y = 12000, label = paste("median: ", MonthlyLoanPaymentMedian), colour = "red")

grid.arrange(plt_EstimatedReturn_MedianLine, plt_TotalCreditLinespast7years_MedianLine, plt_TotalInquiries_MedianLine, plt_DelinquenciesLast7Years_MedianLine,
             plt_TotalTrades_MedianLine, plt_DebtToIncomeRatio_MedianLine, plt_TotalProsperPaymentsBilled_MedianLine,
             plt_ProsperPrincipalBorrowed_MedianLine, plt_ProsperPrincipalOutstanding_MedianLine, plt_LoanOriginalAmount_MedianLine, plt_MonthlyLoanPayment_MedianLine, ncol = 4)

par(mfrow = c(3, 2), mar = c(4, 13, 2, 2))
# histogram of IncomeRange
barplot(table(pld.ana$IncomeRange), horiz = T, las = 2)
title(main = "IncomeRange", cex.main = 2)

# histogram of Occupation
barplot(table(pld.Occupation$Occupation), horiz = T, las = 2)
title(main = "Occupations", cex.main = 2)

# histogram of IsBorrowerHomeowner
barplot(table(pld.ana$IsBorrowerHomeowner), horiz = T, las = 2)
title(main = "IsBorrowerHomeowner", cex.main = 2)

# histogram of ProsperScore
barplot(table(pld.ana$ProsperScore), horiz = T, las = 2)
title(main = "ProsperScore", cex.main = 2)

IncomeRange of $25,000-$49,999 and $50,000-$74,999 have the most frequencies of around 30000. Occupations selected and the IsBorrowerHomeowner categories are basically equally distributed respectively. ProsperScore of 4,6 and 8 are most common in each category of the data.

pld.IncomeRange <- subset(pld.ana, IncomeRange == "$1-24,999" | IncomeRange == "$25,000-49,999" | IncomeRange == "$50,000-74,999" | IncomeRange == "$75,000-99,999" | IncomeRange == "$100,000+")
# EstimatedReturn by IncomeRange
ggplot(aes(EstimatedReturn), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange), binwidth = 0.005) +
  xlim(0, 0.25)

by(pld.ana$EstimatedReturn, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.0166  0.1132  0.1243  0.1157  0.1360  0.1698     576 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.1656  0.0878  0.1123  0.1092  0.1271  0.2265    2620 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.1827  0.0672  0.0827  0.0879  0.1074  0.2667    2132 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.182   0.080   0.101   0.102   0.124   0.257    8017 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.168   0.074   0.090   0.095   0.114   0.284    5423 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.0868  0.0718  0.0872  0.0920  0.1112  0.2570    2418 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    7741 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -0.0105  0.1110  0.1221  0.1194  0.1358  0.2265     157

Among the Income ranges, $25,000-49,999 and $50,000-74,999 have the most frequencies; $75,000-99,999 and $100,000+ come next. Form the plot, we can see that the blue line($50,000-74,999) and green line ($25,000-49,999) have more counts than other ranges; but blue line and green line have different trends as the return goes more than around 0.08; the green line gets more counts than blue line afterwards.

# TotalCreditLinespast7years by IncomeRange
ggplot(aes(TotalCreditLinespast7years), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange))

by(pld.ana$TotalCreditLinespast7years, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   11.00   21.00   22.97   31.00  101.00       3 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00    9.00   15.00   17.55   24.00  101.00       7 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   23.00   32.00   33.12   41.00  136.00       1 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   15.00   22.00   23.48   30.00  118.00       6 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   19.00   26.00   27.99   35.00  107.00       1 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   21.00   29.00   30.48   38.00  124.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   12.00   20.00   22.24   29.00  127.00     677 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   11.00   18.00   20.26   27.00   71.00       2

The green line ($25,000-49,999) has two modes for total credit lines, which are at around 13 and 25 respectively, and the latter one is also the mode of other ranges except the red line($1-24,999).

# TotalInquiries by IncomeRange
ggplot(aes(TotalInquiries), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange)) +
  xlim(0, 10)

by(pld.ana$TotalInquiries, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   7.836  10.000  78.000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   4.166   5.000 112.000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   6.045   8.000  90.000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   4.845   6.000 117.000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   5.318   7.000 109.000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   5.567   7.000 158.000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    4.00    8.00   10.95   14.00  379.00    1159 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   3.087   4.000  29.000

In this plot, all the ranges have approximate trends.

# TotalTrades by IncomeRange
ggplot(aes(TotalTrades), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange), binwidth = 5) +
  coord_cartesian(xlim = c(0, 100))

by(pld.ana$TotalTrades, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   17.00   19.03   26.00   86.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   14.27   19.00   83.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   20.00   28.00   29.35   37.00  126.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   12.00   18.00   19.73   26.00   91.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   16.00   23.00   24.03   30.00  103.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   25.00   26.51   33.00  122.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   11.00   18.00   19.45   26.75   65.00    7543 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    8.00   14.00   16.26   21.00   65.00       1

The mode of green line ($25,000-49,999) has less trades than that of other ranges except the red line.

# DebtToIncomeRatio by IncomeRange
ggplot(aes(DebtToIncomeRatio), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange), binwidth = 0.2) +
  coord_cartesian(xlim = c(0, 1.5))

by(pld.ana$DebtToIncomeRatio, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA     621 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.190   0.320   0.737   0.500  10.010     913 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1200  0.1700  0.1806  0.2300 10.0100    1266 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1700  0.2600  0.2789  0.3600  7.9000    2311 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1600  0.2300  0.2457  0.3200 10.0100    1690 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1400  0.2000  0.2137  0.2800  2.5500     901 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.090   0.160   0.297   0.260  10.010     124 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   0.160   0.295   3.328  10.010  10.010     728

All the ranges have the same mode, but the green line ($25,000-49,999) decreases slower than the blue line ($50,000-74,999).

# TotalProsperPaymentsBilled by IncomeRange
ggplot(aes(TotalProsperPaymentsBilled), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange), binwidth = 5) +
  coord_cartesian(xlim = c(0, 100))

by(pld.ana$TotalProsperPaymentsBilled, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   14.00   18.39   21.50   81.00     542 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   22.29   32.00  116.00    6029 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   22.92   33.00  133.00   13335 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     9.0    16.0    22.3    33.0   131.0   25808 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   23.43   34.00  141.00   24521 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   16.00   23.63   34.00  128.00   13198 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    7741 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.00   13.00   18.41   27.00   70.00     678

All ranges here have similar trends.

# ProsperPrincipalBorrowed by IncomeRange
ggplot(aes(ProsperPrincipalBorrowed), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange), binwidth = 1000) +
  coord_cartesian(xlim = c(0, 40000))

by(pld.ana$ProsperPrincipalBorrowed, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    2512    5000    8374   10000   40000     542 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    2000    4000    5147    6400   37000    6029 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    5000   10000   12230   15500   72500   13335 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3000    5000    6449    8000   67000   25808 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    3500    6000    8094   10400   65000   24521 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    4000    7500    9791   13500   55900   13198 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    7741 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1000    2475    4000    5185    6525   23600     678

We can see here that the green line ($25,000-49,999) has different trends than other ranges, after principal borrowed larger tahn 20000, all ranges except the green one tend to be flatter, just the green one is still noisy.

# ProsperPrincipalOutstanding by IncomeRange
ggplot(aes(ProsperPrincipalOutstanding), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange)) +
  coord_cartesian(xlim = c(0, 15000))

by(pld.ana$ProsperPrincipalOutstanding, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0   941.8  2453.0  4068.0 13280.0     542 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    1005    1852    2767   20320    6029 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    2454    4083    6445   23260   13335 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    1478    2375    3405   21590   25808 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    1659    2794    3956   23030   24521 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    1882    3280    5096   23450   13198 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA    7741 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0    1147    2149    3501   13740     678
# LoanOriginalAmount by IncomeRange
ggplot(aes(LoanOriginalAmount), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange))

by(pld.ana$LoanOriginalAmount, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    5000    7411   10000   25000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2052    4000    4274    5000   25000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    6000   12000   13070   18500   35000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    5000    6178    9800   25000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    7500    8675   13500   25000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    9700   10370   15000   25000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2100    3033    5170    6001   25000 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    2500    4000    4885    6000   25000
# MonthlyLoanPayment by IncomeRange
ggplot(aes(MonthlyLoanPayment), data = pld.IncomeRange) +
  geom_freqpoly(aes(color = IncomeRange)) +
  coord_cartesian(xlim = c(0, 1300))

by(pld.ana$MonthlyLoanPayment, pld.ana$IncomeRange, summary)
## pld.ana$IncomeRange: $0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   87.14  169.70  267.50  347.60 1131.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $1-24,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   86.38  134.30  154.70  173.70 1048.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $100,000+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   208.6   375.0   412.2   560.1  2252.0 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $25,000-49,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   118.9   173.7   210.4   282.8  1382.0 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $50,000-74,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   155.8   253.1   280.3   383.7  1778.0 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: $75,000-99,999
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   169.5   301.6   329.4   457.2  2112.0 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not displayed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   76.62  122.50  182.40  217.80 1048.00 
## -------------------------------------------------------- 
## pld.ana$IncomeRange: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   90.28  169.60  183.80  217.70 1086.00

The green line ($25,000-49,999) here decreases rapidly after the mode, from these plots above, we could see that the green line has relative different distribution than other ranges, so let me assume that the income range of $25,000-49,999 has more influence than other ranges.

# EstimatedReturn by Occupation
bplt_EstimatedReturn_by_Occupation <- ggplot(aes(x = Occupation, y = EstimatedReturn), data = subset(pld.Occupation, !is.na(EstimatedReturn))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0.05, 0.15))
bplt_EstimatedReturn_by_Occupation

pld.Occupation.EstimatedReturn_by_Occupation <- subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(Occupation)) %>%
  group_by(Occupation) %>%
  summarise(mean_EstimatedReturn = mean(EstimatedReturn),
            median_EstimatedReturn = median(EstimatedReturn),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.EstimatedReturn_by_Occupation 
## # A tibble: 6 x 4
##                 Occupation mean_EstimatedReturn median_EstimatedReturn
##                     <fctr>                <dbl>                  <dbl>
## 1           Accountant/CPA           0.09380908                0.08922
## 2 Administrative Assistant           0.10391261                0.10423
## 3      Computer Programmer           0.08935118                0.08360
## 4                Executive           0.09055953                0.08529
## 5       Sales - Commission           0.09614054                0.09130
## 6                  Teacher           0.09582725                0.09130
## # ... with 1 more variables: n <int>

Administrative Assistant has the highest mean and median values with other occupations a little bit less.

# EstimatedReturn by ProsperScore
bplt_EstimatedReturn_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = EstimatedReturn), data = subset(pld.ana, !is.na(EstimatedReturn) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0.025, 0.15))
bplt_EstimatedReturn_by_ProsperScore

pld.EstimatedReturn_by_ProsperScore <- subset(pld.ana, !is.na(EstimatedReturn) & !is.na(ProsperScore)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_EstimatedReturn = mean(EstimatedReturn),
            median_EstimatedReturn = median(EstimatedReturn),
            n = n()) %>%
  arrange(ProsperScore)
pld.EstimatedReturn_by_ProsperScore
## # A tibble: 11 x 4
##    ProsperScore mean_EstimatedReturn median_EstimatedReturn     n
##          <fctr>                <dbl>                  <dbl> <int>
## 1             1           0.10513663               0.124600   992
## 2             2           0.10956012               0.110700  5766
## 3             3           0.10455803               0.104065  7642
## 4             4           0.10111778               0.096090 12595
## 5             5           0.10792146               0.108700  9813
## 6             6           0.10490942               0.100100 12278
## 7             7           0.09905789               0.089220 10597
## 8             8           0.08703009               0.078240 12053
## 9             9           0.07571866               0.070100  6911
## 10           10           0.06149748               0.057820  4750
## 11           11           0.05621365               0.053420  1456

Generally, the return is decreasing as Prosper score becomes higher.

# TotalCreditLinespast7years by Occupation
bplt_TotalCreditLinepast7years_by_Occupation <- ggplot(aes(x = Occupation, y = TotalCreditLinespast7years), data = subset(pld.Occupation, !is.na(TotalCreditLinespast7years))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(10, 45))
bplt_TotalCreditLinepast7years_by_Occupation

pld.Occupation.TotalCreditLinespast7years_by_Occupation <- subset(pld.Occupation, !is.na(TotalCreditLinespast7years)) %>%
  group_by(Occupation) %>%
  summarise(mean_TotalCreditLinespast7years = mean(TotalCreditLinespast7years),
            median_TotalCreditLinespast7years = median(TotalCreditLinespast7years),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.TotalCreditLinespast7years_by_Occupation 
## # A tibble: 6 x 4
##                 Occupation mean_TotalCreditLinespast7years
##                     <fctr>                           <dbl>
## 1           Accountant/CPA                        30.43007
## 2 Administrative Assistant                        26.00000
## 3      Computer Programmer                        25.51665
## 4                Executive                        31.55731
## 5       Sales - Commission                        26.57342
## 6                  Teacher                        30.90205
## # ... with 2 more variables: median_TotalCreditLinespast7years <dbl>,
## #   n <int>

Executive and Teacher have the higheset median, with Accoutant/CPA and Administrative Assistant the lowest.

# TotalCreditLinespast7years by ProsperScore
bplt_TotalCreditLinepast7years_by_ProsperScore <-ggplot(aes(x = ProsperScore, y = TotalCreditLinespast7years), data = subset(pld.ana, !is.na(TotalCreditLinespast7years) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(10, 45))
bplt_TotalCreditLinepast7years_by_ProsperScore

pld.TotalCreditLinespast7years_by_ProsperScore <- subset(pld.ana, !is.na(TotalCreditLinespast7years)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_TotalCreditLinespast7years = mean(TotalCreditLinespast7years),
            median_TotalCreditLinespast7years = median(TotalCreditLinespast7years),
            n = n()) %>%
  arrange(ProsperScore)
pld.TotalCreditLinespast7years_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_TotalCreditLinespast7years
##          <fctr>                           <dbl>
## 1             1                        34.44052
## 2             2                        28.73014
## 3             3                        28.58872
## 4             4                        27.85907
## 5             5                        27.61796
## 6             6                        27.04781
## 7             7                        27.02850
## 8             8                        27.08322
## 9             9                        26.74591
## 10           10                        27.97663
## 11           11                        30.14629
## 12           NA                        24.05731
## # ... with 2 more variables: median_TotalCreditLinespast7years <dbl>,
## #   n <int>

As the Prosper score gets higher, the boxbplot has a shape of being close to a polynomial, which is decreaing first and then increasing.

# TotalInquiries by Occupation
bplt_TotalInquiries_by_Occupation <- ggplot(aes(x = Occupation, y = TotalInquiries), data = subset(pld.Occupation, !is.na(TotalInquiries))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 10))
bplt_TotalInquiries_by_Occupation

pld.Occupation.TotalInquiries_by_Occupation <- subset(pld.Occupation, !is.na(TotalInquiries)) %>%
  group_by(Occupation) %>%
  summarise(mean_TotalInquiries = mean(TotalInquiries),
            median_TotalInquiries = median(TotalInquiries),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.TotalInquiries_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_TotalInquiries median_TotalInquiries     n
##                     <fctr>               <dbl>                 <dbl> <int>
## 1           Accountant/CPA            5.617074                     4  3233
## 2 Administrative Assistant            5.460266                     4  3687
## 3      Computer Programmer            5.674408                     4  4478
## 4                Executive            6.264208                     5  4311
## 5       Sales - Commission            6.499420                     4  3446
## 6                  Teacher            5.166002                     3  3759

Executive has the highest median, with other occupations equally less.

# TotalInquiries by ProsperScore
bplt_TotalInquiries_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = TotalInquiries), data = subset(pld.ana, !is.na(TotalInquiries) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 15))
bplt_TotalInquiries_by_ProsperScore

pld.TotalInquiries_by_ProsperScore <- subset(pld.ana, !is.na(TotalInquiries)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_TotalInquiries = mean(TotalInquiries),
            median_TotalInquiries = median(TotalInquiries),
            n = n()) %>%
  arrange(ProsperScore)
pld.TotalInquiries_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_TotalInquiries median_TotalInquiries     n
##          <fctr>               <dbl>                 <dbl> <int>
## 1             1           10.718750                     9   992
## 2             2            5.834374                     5  5766
## 3             3            5.273619                     4  7642
## 4             4            4.802541                     4 12595
## 5             5            4.301743                     3  9813
## 6             6            3.990878                     3 12278
## 7             7            3.823535                     3 10597
## 8             8            3.482370                     3 12053
## 9             9            3.479959                     3  6911
## 10           10            3.302526                     3  4750
## 11           11            3.811126                     3  1456
## 12           NA            9.516383                     7 27925

Total inquiries decrease as score increases, and there is a bigger difference between the first 2 than between one another among score 2-11.

# DelinquenciesLast7Years by Occupation
bplt_DelinquenciesLast7Years_by_Occupation <- ggplot(aes(x = Occupation, y = DelinquenciesLast7Years), data = subset(pld.Occupation, !is.na(DelinquenciesLast7Years))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 5))
bplt_DelinquenciesLast7Years_by_Occupation

by(pld.Occupation$DelinquenciesLast7Years, pld.Occupation$Occupation, summary)
## pld.Occupation$Occupation: Accountant/CPA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.955   2.000  99.000       1 
## -------------------------------------------------------- 
## pld.Occupation$Occupation: Administrative Assistant
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.748   4.000  99.000       7 
## -------------------------------------------------------- 
## pld.Occupation$Occupation: Computer Programmer
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.831   0.000  99.000       3 
## -------------------------------------------------------- 
## pld.Occupation$Occupation: Executive
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.553   1.000  99.000       1 
## -------------------------------------------------------- 
## pld.Occupation$Occupation: Sales - Commission
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.792   4.000  99.000       7 
## -------------------------------------------------------- 
## pld.Occupation$Occupation: Teacher
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.798   4.000  99.000       2

Every occupation here has median of 0, but Executive has the smallest variance while Administrative Assistant, Sales - Commission and Teacher have the biggest.

# DelinquenciesLast7Years by ProsperScore
bplt_DelinquenciesLast7Years_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = DelinquenciesLast7Years), data = subset(pld.ana, !is.na(DelinquenciesLast7Years) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 8))
bplt_DelinquenciesLast7Years_by_ProsperScore

by(pld.ana$DelinquenciesLast7Years, pld.ana$ProsperScore, summary)
## pld.ana$ProsperScore: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   6.846   8.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.996   5.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.576   4.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   4.255   3.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.934   3.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.807   2.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.702   2.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.995   1.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.327   0.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.587   0.000  99.000 
## -------------------------------------------------------- 
## pld.ana$ProsperScore: 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.414   0.000  76.000

Just like other variables, the distribution of delinquencies get smaller as score gets higher.

# TotalTrades by Occupation
bplt_TotalTrades_by_Occupation <- ggplot(aes(x = Occupation, y = TotalTrades), data = subset(pld.Occupation, !is.na(TotalTrades))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(10, 45))
bplt_TotalTrades_by_Occupation

pld.Occupation.TotalTrades_by_Occupation <- subset(pld.Occupation, !is.na(TotalTrades)) %>%
  group_by(Occupation) %>%
  summarise(mean_TotalTrades = mean(TotalTrades),
            median_TotalTrades = median(TotalTrades),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.TotalTrades_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_TotalTrades median_TotalTrades     n
##                     <fctr>            <dbl>              <dbl> <int>
## 1           Accountant/CPA         25.84001                 25  3119
## 2 Administrative Assistant         22.46338                 21  3509
## 3      Computer Programmer         22.43504                 21  4218
## 4                Executive         28.07990                 27  4155
## 5       Sales - Commission         23.35712                 22  3195
## 6                  Teacher         25.07105                 24  3617

Executive has the highest median values with Accoutant/CPA and Administrative Assistant the lowest.

# TotalTrades by ProsperScore
bplt_TotalTrades_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = TotalTrades), data = subset(pld.ana, !is.na(TotalTrades) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(12, 45))
bplt_TotalTrades_by_ProsperScore

pld.TotalTrades_by_ProsperScore <- subset(pld.ana, !is.na(TotalTrades)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_TotalTrades = mean(TotalTrades),
            median_TotalTrades = median(TotalTrades),
            n = n()) %>%
  arrange(ProsperScore)
pld.TotalTrades_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_TotalTrades median_TotalTrades     n
##          <fctr>            <dbl>              <dbl> <int>
## 1             1         29.39113                 28   992
## 2             2         24.39716                 23  5766
## 3             3         24.47265                 23  7642
## 4             4         23.89162                 23 12595
## 5             5         23.74330                 22  9813
## 6             6         23.29093                 22 12278
## 7             7         23.45485                 22 10597
## 8             8         23.59379                 22 12053
## 9             9         23.52192                 22  6911
## 10           10         25.06589                 24  4750
## 11           11         26.87981                 26  1456
## 12           NA         20.47827                 18 21540

Like the TotalCreditLineslast7years variable, as the Prosper score gets higher, the boxbplot has a shape of being close to a polynomial, which is decreasing first and then increasing, but relatively smaller changes.

# DebtToIncomeRatio by Occupation
bplt_DebtToIncomeRatio_by_Occupation <- ggplot(aes(x = Occupation, y = DebtToIncomeRatio), data = subset(pld.Occupation, !is.na(DebtToIncomeRatio))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 0.5))
bplt_DebtToIncomeRatio_by_Occupation

pld.Occupation.DebtToIncomeRatio_by_Occupation <- subset(pld.Occupation, !is.na(DebtToIncomeRatio)) %>%
  group_by(Occupation) %>%
  summarise(mean_DebtToIncomeRatio = mean(DebtToIncomeRatio),
            median_DebtToIncomeRatio = median(DebtToIncomeRatio),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.DebtToIncomeRatio_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_DebtToIncomeRatio median_DebtToIncomeRatio
##                     <fctr>                  <dbl>                    <dbl>
## 1           Accountant/CPA              0.2445767                     0.22
## 2 Administrative Assistant              0.3018444                     0.25
## 3      Computer Programmer              0.2000212                     0.18
## 4                Executive              0.2114236                     0.18
## 5       Sales - Commission              0.2577339                     0.19
## 6                  Teacher              0.3046247                     0.26
## # ... with 1 more variables: n <int>

Administrative Assistant and Teacher have the highest median, with Computer Programmer and Executive the lowest.

# DebtToIncomeRatio by ProsperScore
bplt_DebtToIncomeRatio_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = DebtToIncomeRatio), data = subset(pld.ana, !is.na(DebtToIncomeRatio) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0.05, 0.45))
bplt_DebtToIncomeRatio_by_ProsperScore

pld.DebtToIncomeRatio_by_ProsperScore <- subset(pld.ana, !is.na(DebtToIncomeRatio)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_DebtToIncomeRatio = mean(DebtToIncomeRatio),
            median_DebtToIncomeRatio = median(DebtToIncomeRatio),
            n = n()) %>%
  arrange(ProsperScore)
pld.DebtToIncomeRatio_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_DebtToIncomeRatio median_DebtToIncomeRatio     n
##          <fctr>                  <dbl>                    <dbl> <int>
## 1             1              0.4275173                     0.33   721
## 2             2              0.3178847                     0.27  4822
## 3             3              0.3212492                     0.28  6580
## 4             4              0.2945494                     0.27 11164
## 5             5              0.2896673                     0.25  8776
## 6             6              0.2681130                     0.24 11309
## 7             7              0.2382009                     0.21  9966
## 8             8              0.2156692                     0.19 11543
## 9             9              0.1939321                     0.17  6625
## 10           10              0.1765553                     0.16  4639
## 11           11              0.2006657                     0.19  1412
## 12           NA              0.3238720                     0.20 27826

DebtToIncomeRatio is clearly dereasing when score gets higher.

# TotalProsperPaymentsBilled by Occupation
bplt_TotalProsperPaymentsBilled_by_Occupation <- ggplot(aes(x = Occupation, y = TotalProsperPaymentsBilled), data = subset(pld.Occupation, !is.na(TotalProsperPaymentsBilled))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 45))
bplt_TotalProsperPaymentsBilled_by_Occupation

pld.Occupation.TotalProsperPaymentsBilled_by_Occupation <- subset(pld.Occupation, !is.na(TotalProsperPaymentsBilled)) %>%
  group_by(Occupation) %>%
  summarise(mean_TotalProsperPaymentsBilled = mean(TotalProsperPaymentsBilled),
            median_TotalProsperPaymentsBilled = median(TotalProsperPaymentsBilled),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.TotalProsperPaymentsBilled_by_Occupation 
## # A tibble: 6 x 4
##                 Occupation mean_TotalProsperPaymentsBilled
##                     <fctr>                           <dbl>
## 1           Accountant/CPA                        22.45644
## 2 Administrative Assistant                        23.30863
## 3      Computer Programmer                        23.50436
## 4                Executive                        21.88734
## 5       Sales - Commission                        20.98701
## 6                  Teacher                        25.18665
## # ... with 2 more variables: median_TotalProsperPaymentsBilled <dbl>,
## #   n <int>

All occupation have similar median values.

# TotalProsperPaymentsBilled by ProsperScore
bplt_TotalProsperPaymentsBilled_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = TotalProsperPaymentsBilled), data = subset(pld.ana, !is.na(TotalProsperPaymentsBilled) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(5, 45))
bplt_TotalProsperPaymentsBilled_by_ProsperScore

pld.TotalProsperPaymentsBilled_by_ProsperScore <- subset(pld.ana, !is.na(TotalProsperPaymentsBilled)) %>%
  group_by(ProsperScore)%>%
  summarise(mean_TotalProsperPaymentsBilled = mean(TotalProsperPaymentsBilled),
            median_TotalProsperPaymentsBilled = median(TotalProsperPaymentsBilled),
            n = n()) %>%
  arrange(ProsperScore)
pld.TotalProsperPaymentsBilled_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_TotalProsperPaymentsBilled
##          <fctr>                           <dbl>
## 1             1                        24.26814
## 2             2                        22.99799
## 3             3                        23.15956
## 4             4                        24.36944
## 5             5                        23.85575
## 6             6                        23.56653
## 7             7                        23.51542
## 8             8                        23.96136
## 9             9                        24.11320
## 10           10                        26.87716
## 11           11                        30.41354
## 12           NA                        11.08566
## # ... with 2 more variables: median_TotalProsperPaymentsBilled <dbl>,
## #   n <int>

The values don’t have much big difference among all scores.

# ProsperPrincipalBorrowed by Occupation
bplt_ProsperPrincipalBorrowed_by_Occupation <- ggplot(aes(x = Occupation, y = ProsperPrincipalBorrowed), data = subset(pld.Occupation, !is.na(ProsperPrincipalBorrowed))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 15000))
bplt_ProsperPrincipalBorrowed_by_Occupation

pld.Occupation.ProsperPrincipalBorrowed_by_Occupation <- subset(pld.Occupation, !is.na(ProsperPrincipalBorrowed)) %>%
  group_by(Occupation) %>%
  summarise(mean_ProsperPrincipalBorrowed = mean(ProsperPrincipalBorrowed),
            median_ProsperPrincipalBorrowed = median(ProsperPrincipalBorrowed),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.ProsperPrincipalBorrowed_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_ProsperPrincipalBorrowed
##                     <fctr>                         <dbl>
## 1           Accountant/CPA                      9356.625
## 2 Administrative Assistant                      7108.740
## 3      Computer Programmer                      9547.729
## 4                Executive                     11032.300
## 5       Sales - Commission                      8523.407
## 6                  Teacher                      7782.451
## # ... with 2 more variables: median_ProsperPrincipalBorrowed <dbl>,
## #   n <int>

Executive has the highest values with Administrative Assistant the lowest.

# ProsperPrincipalBorrowed by ProsperScore
bplt_ProsperPrincipalBorrowed_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = ProsperPrincipalBorrowed), data = subset(pld.ana, !is.na(ProsperPrincipalBorrowed) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 22000))
bplt_ProsperPrincipalBorrowed_by_ProsperScore

pld.ProsperPrincipalBorrowed_by_ProsperScore <-  subset(pld.ana, !is.na(ProsperPrincipalBorrowed)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_ProsperPrincipalBorrowed = mean(ProsperPrincipalBorrowed),
            median_ProsperPrincipalBorrowed = median(ProsperPrincipalBorrowed),
            n = n()) %>%
  arrange(ProsperScore)
pld.ProsperPrincipalBorrowed_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_ProsperPrincipalBorrowed
##          <fctr>                         <dbl>
## 1             1                      6698.976
## 2             2                      7436.541
## 3             3                      7369.556
## 4             4                      7669.675
## 5             5                      7765.790
## 6             6                      8011.056
## 7             7                      8696.934
## 8             8                      9117.895
## 9             9                      9321.669
## 10           10                     11832.318
## 11           11                     14536.176
## 12           NA                      6012.382
## # ... with 2 more variables: median_ProsperPrincipalBorrowed <dbl>,
## #   n <int>

In contrast to previous trend, the principal borrowed increases as score increases, and there is two times difference between score 11 and score 1.

# ProsperPrincipalOutstanding by Occupation
bplt_ProsperPrincipalOutstanding_by_Occupation <- ggplot(aes(x = Occupation, y = ProsperPrincipalOutstanding), data = subset(pld.Occupation, !is.na(ProsperPrincipalOutstanding))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 7500))
bplt_ProsperPrincipalOutstanding_by_Occupation

pld.Occupation.ProsperPrincipalOutstanding_by_Occupation <- subset(pld.Occupation, !is.na(ProsperPrincipalOutstanding)) %>%
  group_by(Occupation) %>%
  summarise(mean_ProsperPrincipalOutstanding = mean(ProsperPrincipalOutstanding), 
            median_ProsperPrincipalOutstanding = median(ProsperPrincipalOutstanding),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.ProsperPrincipalOutstanding_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_ProsperPrincipalOutstanding
##                     <fctr>                            <dbl>
## 1           Accountant/CPA                         3179.139
## 2 Administrative Assistant                         2750.100
## 3      Computer Programmer                         2721.382
## 4                Executive                         4171.374
## 5       Sales - Commission                         3040.752
## 6                  Teacher                         2968.234
## # ... with 2 more variables: median_ProsperPrincipalOutstanding <dbl>,
## #   n <int>

As usual, Executive has the highest median, but now it gets much higher values than other occupations, which have similar values

# ProsperPrincipalOutstanding by ProsperScore
bplt_ProsperPrincipalOutstanding_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = ProsperPrincipalOutstanding), data = subset(pld.ana, !is.na(ProsperPrincipalOutstanding) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 8000))
bplt_ProsperPrincipalOutstanding_by_ProsperScore

pld.ProsperPrincipalOutstanding_by_ProsperScore <- subset(pld.ana, !is.na(ProsperPrincipalOutstanding)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_ProsperPrincipalOutstanding = mean(ProsperPrincipalOutstanding),
            median_ProsperPrincipalOutstanding = median(ProsperPrincipalOutstanding),
            n = n()) %>%
  arrange(ProsperScore)
pld.ProsperPrincipalOutstanding_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_ProsperPrincipalOutstanding
##          <fctr>                            <dbl>
## 1             1                         2153.677
## 2             2                         2999.903
## 3             3                         2925.260
## 4             4                         2827.027
## 5             5                         2892.634
## 6             6                         2777.285
## 7             7                         3069.913
## 8             8                         2867.231
## 9             9                         3009.148
## 10           10                         3024.799
## 11           11                         3590.115
## 12           NA                         3027.456
## # ... with 2 more variables: median_ProsperPrincipalOutstanding <dbl>,
## #   n <int>
# LoanOriginalAmount by Occupation
bplt_LoanOriginalAmount_by_Occupation <- ggplot(aes(x = Occupation, y = LoanOriginalAmount), data = subset(pld.Occupation, !is.na(LoanOriginalAmount))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 18000))
bplt_LoanOriginalAmount_by_Occupation

pld.Occupation.LoanOriginalAmount_by_Occupation <- subset(pld.Occupation, !is.na(LoanOriginalAmount)) %>%
  group_by(Occupation) %>%
  summarise(mean_LoanOriginalAmount = mean(LoanOriginalAmount),
            median_LoanOriginalAmount = median(LoanOriginalAmount),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.LoanOriginalAmount_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_LoanOriginalAmount
##                     <fctr>                   <dbl>
## 1           Accountant/CPA                9195.888
## 2 Administrative Assistant                6598.894
## 3      Computer Programmer                9420.892
## 4                Executive               11890.577
## 5       Sales - Commission                8763.208
## 6                  Teacher                7887.450
## # ... with 2 more variables: median_LoanOriginalAmount <dbl>, n <int>

Executive has the highest values with Administrative Assistant the lowest.

# LoanOriginalAmount by ProsperScore
bplt_LoanOriginalAmount_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = LoanOriginalAmount), data = subset(pld.ana, !is.na(LoanOriginalAmount) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 20000))
bplt_LoanOriginalAmount_by_ProsperScore

pld.LoanOriginalAmount_by_ProsperScore <- subset(pld.ana, !is.na(LoanOriginalAmount)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_LoanOriginalAmount = mean(LoanOriginalAmount),
            median_LoanOriginalAmount = median(LoanOriginalAmount),
            n = n()) %>%
  arrange(ProsperScore)
pld.LoanOriginalAmount_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_LoanOriginalAmount median_LoanOriginalAmount     n
##          <fctr>                   <dbl>                     <dbl> <int>
## 1             1                4570.955                      4000   992
## 2             2                5279.778                      4000  5766
## 3             3                7062.552                      4500  7642
## 4             4                8401.920                      7500 12595
## 5             5                8400.081                      7000  9813
## 6             6                9222.604                      8000 12278
## 7             7               10097.153                      9500 10597
## 8             8               10487.978                     10000 12053
## 9             9               10055.976                      8300  6911
## 10           10               11742.895                     10000  4750
## 11           11               14858.186                     15000  1456
## 12           NA                6159.303                      4500 29084

There a huge gap between score 1 and score 11. The original amount is getting higher fast as score get higher so that the value’s more than 3 times when it comes to score 11.

# MonthlyLoanPayment by Occupation
bplt_MonthlyLoanPayment_by_Occupation <- ggplot(aes(x = Occupation, y = MonthlyLoanPayment), data = subset(pld.Occupation, !is.na(MonthlyLoanPayment))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 600))
bplt_MonthlyLoanPayment_by_Occupation

pld.Occupation.MonthlyLoanPayment_by_Occupation <- subset(pld.Occupation, !is.na(MonthlyLoanPayment)) %>%
  group_by(Occupation) %>%
  summarise(mean_MonthlyLoanPayment = mean(MonthlyLoanPayment),
            median_MonthlyLoanPayment = median(MonthlyLoanPayment),
            n = n()) %>%
  arrange(Occupation)
pld.Occupation.MonthlyLoanPayment_by_Occupation
## # A tibble: 6 x 4
##                 Occupation mean_MonthlyLoanPayment
##                     <fctr>                   <dbl>
## 1           Accountant/CPA                297.6405
## 2 Administrative Assistant                224.3001
## 3      Computer Programmer                306.5021
## 4                Executive                378.6373
## 5       Sales - Commission                287.7390
## 6                  Teacher                255.0617
## # ... with 2 more variables: median_MonthlyLoanPayment <dbl>, n <int>

As above, Executive has the highest values with Administrative Assistant the lowest.

# MonthlyLoanPayment by ProsperScore
bplt_MonthlyLoanPayment_by_ProsperScore <- ggplot(aes(x = ProsperScore, y = MonthlyLoanPayment), data = subset(pld.ana, !is.na(MonthlyLoanPayment) & !is.na(ProsperScore))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 600))
bplt_MonthlyLoanPayment_by_ProsperScore

pld.MonthlyLoanPayment_by_ProsperScore <- subset(pld.ana, !is.na(MonthlyLoanPayment)) %>%
  group_by(ProsperScore) %>%
  summarise(mean_MonthlyLoanPayment = mean(MonthlyLoanPayment),
            median_MonthlyLoanPayment = median(MonthlyLoanPayment),
            n = n()) %>%
  arrange(ProsperScore)
pld.MonthlyLoanPayment_by_ProsperScore
## # A tibble: 12 x 4
##    ProsperScore mean_MonthlyLoanPayment median_MonthlyLoanPayment     n
##          <fctr>                   <dbl>                     <dbl> <int>
## 1             1                194.5868                   171.410   992
## 2             2                201.7637                   166.540  5766
## 3             3                251.5436                   174.200  7642
## 4             4                283.4173                   252.670 12595
## 5             5                282.8980                   246.410  9813
## 6             6                299.7858                   270.425 12278
## 7             7                316.4242                   290.180 10597
## 8             8                319.9872                   287.110 12053
## 9             9                295.2916                   251.700  6911
## 10           10                336.3999                   309.120  4750
## 11           11                424.0375                   402.315  1456
## 12           NA                215.7157                   153.800 29084

There a huge gap between score 1 and score 11 as in variable LoanOriginalAmount. And there is neary 3 times difference between score 1 and score 11.

In summary, the occupations of Administrative Assistant and Executive have more influence and the Prosper scores generally have monotonically positive or negative effects.

Bivariate Analysis

# scatterplot of TotalCreditLinespast7years and EstimatedReturn
ggplot(aes(x = TotalCreditLinespast7years, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalCreditLinespast7years))) +
  geom_point(alpha = 1/20, position = "jitter") +
  geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1 ) +
  geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
  stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)

# correlation between TotalCreditLinespast7years and EstimatedReturn
cor.test(pld$TotalCreditLinespast7years, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$TotalCreditLinespast7years and pld$EstimatedReturn
## t = -10.59, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04304943 -0.02961027
## sample estimates:
##         cor 
## -0.03633149

The correlation is -0.036, which is a very small value as for the linear relationship between EstimatedReturn and TotalCreditLineslast7years, as we can see from the scatterplot, whose mean, median and quantile lines are almost paralell to the x-axis, and when the credit lines come to the value of 75, the lines become more noisy compared to the previous ones.

# scatterplot of TotalInquiries and EstimatedReturn
ggplot(aes(x = TotalInquiries, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalInquiries))) +
  geom_point(alpha = 1/20, position = "jitter") +
  geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
  geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
  stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)

# correlation between TotalInquiries and EstimatedReturn
cor.test(pld$TotalInquiries, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$TotalInquiries and pld$EstimatedReturn
## t = 24.491, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07709580 0.09045827
## sample estimates:
##       cor 
## 0.0837808

The correlation here is 0.084, which is also very small but positive, meaning that even though the linear relationship between EstimatedReturn and TotalInquiries is very tiny but there is a positive linear relationship which is contrary to the relationship between EstimatedReturn and TotalCreditLineslast7years. ANd more than that, there is a big gap at some point in the plot.

# scatterplot of DelinquenciesLast7Years and EstimatedReturn
ggplot(aes(x = DelinquenciesLast7Years, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DelinquenciesLast7Years))) +
  geom_point(alpha = 1/20, position = "jitter") +
  geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
  geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
  stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1) 

# correlation between DelinquenciesLast7Years and EstimatedReturn
cor.test(pld$DelinquenciesLast7Years, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$DelinquenciesLast7Years and pld$EstimatedReturn
## t = 27.632, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08776313 0.10110004
## sample estimates:
##        cor 
## 0.09443582

Most of points here are centered at the x = 0, where delinquency is euqal to 0, and the linear relationship is not big as well, but the distribution is more even compared to previous two.

# scatterplot of TotalTrades and EstimatedReturn
ggplot(aes(x = TotalTrades, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalTrades))) +
  geom_point(alpha = 1/20, position = "jitter") +
  geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
  geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
  stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1) 

# correlation between TotalTrades and EstimatedReturn
cor.test(pld$TotalTrades, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$TotalTrades and pld$EstimatedReturn
## t = -19.168, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07235914 -0.05896024
## sample estimates:
##         cor 
## -0.06566265

The relationship between EstimatedReturn and TotalTrades is negative, and the points are mostly cencered in some area where return is between 0.05 and 0.13, and trades are between 6 to 37.

# scatterplot of DebtToIncomeRatio and EstimatedReturn
ggplot(aes(x = DebtToIncomeRatio, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DebtToIncomeRatio))) +
  geom_point(alpha = 1/20, position = "jitter") +
  geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
  geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
  stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)

# correlation between DebtToIncomeRatio and EstimatedReturn
cor.test(pld$DebtToIncomeRatio, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$DebtToIncomeRatio and pld$EstimatedReturn
## t = 24.387, df = 77555, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08024760 0.09421615
## sample estimates:
##        cor 
## 0.08723617

The points are center in a small area, and the two quantile lines is positively sloped meaning that the two variables are positively related, whose correlation is 0.087, not big, but somehow linear relatd.

# scatterplot of TotalProsperPaymentsBilled and EstimatedReturn
ggplot(aes(x = TotalProsperPaymentsBilled, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalProsperPaymentsBilled))) +
  geom_point(alpha = 1/20, position = "jitter") +
  geom_line(stat = "summary", fun.y = mean, color = "blue", lwd = 1) +
  geom_line(stat = "summary", fun.y = median, color = "orange", linetype = 2, lwd = 1)+
  stat_quantile(quantiles = c(0.25, 0.75), color = "green", linetype = 3, lwd = 1)  

# correlation between TotalProsperPaymentsBilled and EstimatedReturn
cor.test(pld$TotalProsperPaymentsBilled, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$TotalProsperPaymentsBilled and pld$EstimatedReturn
## t = -3.6879, df = 19795, p-value = 0.0002267
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04011842 -0.01227741
## sample estimates:
##       cor 
## -0.026203

The poins here are not so centerd as above, and the lines are almost horizontal, as the correlation test shows, the correlation between EstimatedReturn and TotalProsperPaymentsBilled is -0.026, which is hardly related.

# scatterplot of ProsperPrincipalBorrowed and EstimatedReturn
ggplot(aes(x = log(ProsperPrincipalBorrowed+1), y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalBorrowed))) +
  geom_point(alpha = 1/20, position = "jitter") +
  stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 1)

# correlation between ProsperPrincipalBorrowed and EstimatedReturn
cor.test(pld$ProsperPrincipalBorrowed, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$ProsperPrincipalBorrowed and pld$EstimatedReturn
## t = -23.885, df = 19795, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1808775 -0.1537977
## sample estimates:
##        cor 
## -0.1673692

compared to the other variables, EstimatedReturn and ProsperPrincipalBorrowed are much more related, whose corralation is -0.167.

# scatterplot of ProsperPrincipalOutstanding and EstimatedReturn
ggplot(aes(x = log(ProsperPrincipalOutstanding+1), y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalOutstanding))) +
  geom_point(alpha = 1/20, position = "jitter") +
  stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 1)

# correlation between ProsperPrincipalBorrowed and EstimatedReturn
cor.test(pld$ProsperPrincipalOutstanding, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$ProsperPrincipalOutstanding and pld$EstimatedReturn
## t = -6.8638, df = 19795, p-value = 6.905e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06261448 -0.03482048
## sample estimates:
##         cor 
## -0.04872691

However, the relationship between EstimatedReturn and ProsperPrincipalOutstanding is small, and the most of the points are at the line of x = 0.

# scatterplot of LoanOriginalAmount and EstimatedReturn
ggplot(aes(x = log(LoanOriginalAmount+1), y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
  geom_point(alpha = 1/20, position = "jitter") +
  stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 1)

# correlation between LoanOriginalAmount and EstimatedReturn
cor.test(pld$LoanOriginalAmount, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$LoanOriginalAmount and pld$EstimatedReturn
## t = -86.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2922833 -0.2799279
## sample estimates:
##        cor 
## -0.2861175

Like the relationship with the ProsperPrincipalBorrowed, the relationship between EstimatedReturn and LoanOriginalAmount has relative high relationship and negatively related.

# scatterplot of MonthlyLoanPayment and EstimatedReturn
ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn), data = subset(pld, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
  geom_point(alpha = 1/20, position = "jitter") +
  stat_quantile(quantiles = c(0.25, 0.5, 0.75), aes(color = ..quantile..), lwd = 2) +
  scale_x_continuous(limits = c(0, 1500), breaks = seq(0, 1500, 250))

# correlation between MonthlyLoanPayment and EstimatedReturn
cor.test(pld$MonthlyLoanPayment, pld$EstimatedReturn, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pld$MonthlyLoanPayment and pld$EstimatedReturn
## t = -76.089, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2590193 -0.2464218
## sample estimates:
##        cor 
## -0.2527313

The relationship between EstimatedReturn and MonthlyLoanPayment is approimately the same as the relationship between EstimatedReturn and LoanOriginalAmount, negative and relatively high.

Multivariate Analysis

# scatterplot of TotalCreditLinespast7years and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalCreditLinespast7years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalCreditLinespast7years))) +
  geom_line()+
  geom_smooth()

adding another variable IsBorrowerHomeowner, we can see that in the relationship between EstimatedReturn and TotalCreditLinespast7years, before some point, homeowner leads to slightly lower return than non-homeowner, but after that, homeowner increases the return largely so that there is a clear difference between the two.

# scatterplot of TotalInquiries and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalInquiries, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalInquiries))) +
  geom_line()+
  geom_smooth() +
  ylim(0, 0.2)

# scatterplot of DelinquenciesLast7Years and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = DelinquenciesLast7Years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DelinquenciesLast7Years))) +
  geom_line() +
  geom_smooth()

# scatterplot of TotalTrades and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalTrades, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalTrades))) +
  geom_line() +
  geom_smooth()

There is the some trend for the relationship between EstimatedReturn that happens in the plot of Estimatedreturn and TotalCreditLinespast7years, just there is a bigger gap at the end in this plot.

# scatterplot of DebtToIncomeRatio and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = DebtToIncomeRatio, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(DebtToIncomeRatio))) +
  geom_line() +
  geom_smooth()

# scatterplot of TotalProsperPaymentsBilled and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = TotalProsperPaymentsBilled, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(TotalProsperPaymentsBilled))) +
  geom_line() +
  geom_smooth()

There is more noise for the red line, which could mean that it is not so stable for non-homeowner than for homeowner to pay on time.

# scatterplot of ProsperPrincipalBorrowed and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = ProsperPrincipalBorrowed, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalBorrowed))) +
  geom_line() +
  geom_smooth()

# scatterplot of ProsperPrincipalOutstanding and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = ProsperPrincipalOutstanding, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalOutstanding))) +
  geom_line() +
  geom_smooth()

# scatterplot of LoanOriginalAmount and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = LoanOriginalAmount, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
  geom_line() +
  geom_smooth()

# scatterplot of MonthlyLoanPayment and EstimatedReturn by IsBorrowerHomeowner and smooth line added
ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
  geom_line() +
  geom_smooth()

In general, homeowner or not does not have much influence on the relationship between EstimatedReturn and other variables.

# another two variables Occupation and IncomeRange added on TotalCreditLinespast7years and EstimatedReturn
ggplot(aes(x = TotalCreditLinespast7years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalCreditLinespast7years)))+
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and IncomeRange added on TotalInquiries and EstimatedReturn
ggplot(aes(x = TotalInquiries, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalInquiries))) +
  geom_line(stat = "summary", fun.y = median)+
  geom_smooth() +
  ylim(0, 0.2) +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and IncomeRange added on DelinquenciesLast7Years and EstimatedReturn
ggplot(aes(x = DelinquenciesLast7Years, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(DelinquenciesLast7Years))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and IncomeRange added on TotalTrades and EstimatedReturn
ggplot(aes(x = TotalTrades, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalTrades))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and DebtToIncomeRatio added on TotalTrades and EstimatedReturn
ggplot(aes(x = DebtToIncomeRatio, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(DebtToIncomeRatio))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and TotalProsperPaymentsBilled added on TotalTrades and EstimatedReturn
ggplot(aes(x = TotalProsperPaymentsBilled, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(TotalProsperPaymentsBilled))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and ProsperPrincipalBorrowed added on TotalTrades and EstimatedReturn
ggplot(aes(x = ProsperPrincipalBorrowed, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalBorrowed))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and ProsperPrincipalOutstanding added on TotalTrades and EstimatedReturn
ggplot(aes(x = ProsperPrincipalOutstanding, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(ProsperPrincipalOutstanding))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

# another two variables Occupation and LoanOriginalAmount added on TotalTrades and EstimatedReturn
ggplot(aes(x = LoanOriginalAmount, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation  ~ IncomeRange)

# another two variables Occupation and MonthlyLoanPayment added on TotalTrades and EstimatedReturn
ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld.Occupation, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
  geom_line(stat = "summary", fun.y = median) +
  geom_smooth() +
  facet_grid(Occupation ~ IncomeRange)

We can see from these plots, the IncomeRange of $25,000-49,999 is quietest range compared to other ranges, especially the range of $50,000-74,999; on the other hand, occupations of Administrative Assistant and Executive are most noisy.

let’s see some plots again.

Last Three Plots

ggplot(aes(x = IncomeRange, y = EstimatedReturn), data = pld.IncomeRange) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0.05, 0.15))

IncomeRanges of $1-24,999 has the highest median return value, but there are much less samples in the data than the range of $25,000-49,999, which got most samples in dataset, so ignoring this range, the range of $25,000-49,999 has the highest median return value.

ggplot(aes(x = Occupation, y = EstimatedReturn), data = subset(pld.Occupation, !is.na(EstimatedReturn))) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0.05, 0.15))

Administrative causes the highest median return.

p1 <- ggplot(aes(x = LoanOriginalAmount, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(LoanOriginalAmount))) +
  geom_line() +
  geom_smooth()

p2 <- ggplot(aes(x = MonthlyLoanPayment, y = EstimatedReturn, color = IsBorrowerHomeowner), data = subset(pld, !is.na(EstimatedReturn) & !is.na(MonthlyLoanPayment))) +
  geom_line() +
  geom_smooth()

grid.arrange(p1, p2, ncol = 2)

There is no big difference between homeowner or non-homeowner, but there are relatively high negative relationshiop between EstimatedReturn and LoanOriginalAmount, and between EstimatedReturn and MonthlyLoanPayment.

Combined all the five plots together, we can see that these two particular groups, IncomeRange of $25,000-49,999 has most stable return, whereas Occupation of Administrative has the most unstable return, and at the meantime, it matters if the borrower is a homeowner or not.